Text-to-speech voice adaptation from sparse training data
نویسندگان
چکیده
Voice adaptation describes the process of converting the output of a text-to-speech synthesizer voice to sound like a different voice after a training process in which only a small amount of the desired target speaker’s speech is seen. We employ a locally linear conversion function based on Gaussian mixture models to map bark-scaled line spectral frequencies. We compare performance for three different estimation methods while varying the number of mixture components and the amount of data used for training. An objective evaluation revealed that all three methods yield similar test results. In perceptual tests, listeners judged the converted speech quality as acceptable and fairly successful in adapting to the target speaker.
منابع مشابه
Voice-based Age and Gender Recognition using Training Generative Sparse Model
Abstract: Gender recognition and age detection are important problems in telephone speech processing to investigate the identity of an individual using voice characteristics. In this paper a new gender and age recognition system is introduced based on generative incoherent models learned using sparse non-negative matrix factorization and atom correction post-processing method. Similar to genera...
متن کاملA New Method for Speech Enhancement Based on Incoherent Model Learning in Wavelet Transform Domain
Quality of speech signal significantly reduces in the presence of environmental noise signals and leads to the imperfect performance of hearing aid devices, automatic speech recognition systems, and mobile phones. In this paper, the single channel speech enhancement of the corrupted signals by the additive noise signals is considered. A dictionary-based algorithm is proposed to train the speech...
متن کاملMIMIC : a voice-adaptive phonetic-tree speech synthesiser
This paper presents Mimic : a decision-tree based concatenative voice adaptive text to speech synthesiser. Mimic integrates text to speech synthesis (TTS) with speech recognition and speaker adaptation. Speech is synthesised from concatenation of triphone synthesis units. The triphone units are obtained from clusters of training examples modelled, labelled and segmented using clustered HMMs and...
متن کاملPersonalizing a speech synthesizer by voice adaptation
A voice adaptation system enables users to quickly create new voices for a text-to-speech system, allowing for the personalization of the synthesis output. The system adapts to the pitch and spectrum of the target speaker, using a probabilistic, locally linear conversion function based on a Gaussian Mixture Model. Numerical and perceptual evaluations reveal insights into the correlation between...
متن کاملText-to-speech synthesis with arbitrary speaker's voice from average voice
This paper describes a technique for synthesizing speech with any desired voice. The technique is based on an HMM-based text-to-speech (TTS) system and MLLR adaptation algorithm. To generate speech of an arbitrarily given target speaker, speaker-independent speech units, i.e., average voice models, is adapted to the target speaker using MLLR framework. In addition to spectrum and pitch adaptati...
متن کامل